-
Notifications
You must be signed in to change notification settings - Fork 41
feat(trainer): Support Framework Labels in Runtimes #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(trainer): Support Framework Labels in Runtimes #56
Conversation
|
@andreyvelich: GitHub didn't allow me to assign the following users: jskswamy, ned2, eoinfennessy, briangallagher. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| __command: tuple[str, ...] = field(init=False, repr=False) | ||
|
|
||
| @property | ||
| def command(self) -> tuple[str, ...]: | ||
| return self.__command | ||
|
|
||
| def set_command(self, command: tuple[str, ...]): | ||
| self.__command = command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I make this __command private, so it should not be accessed or modified by users.
| EXEC_FUNC_SCRIPT = textwrap.dedent( | ||
| """ | ||
| read -r -d '' SCRIPT << EOM\n | ||
| {func_code} | ||
| EOM | ||
| printf "%s" \"$SCRIPT\" > \"{func_file}\" | ||
| __ENTRYPOINT__ \"{func_file}\"""" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For runtimes with CustomTrainer, I insert the complete command when Runtime object is created:
sdk/python/kubeflow/trainer/utils/utils.py
Lines 155 to 164 in f3e9020
| # Set the Trainer entrypoint. | |
| if framework == types.TORCH_TUNE: | |
| trainer.set_command(constants.TORCHTUNE_COMMAND) | |
| elif ml_policy.torch: | |
| trainer.set_command(constants.TORCH_COMMAND) | |
| elif ml_policy.mpi: | |
| trainer.set_command(constants.MPI_COMMAND) | |
| else: | |
| trainer.set_command(constants.DEFAULT_COMMAND) | |
I think, this is much clear approach since we always have the final container command in runtime.trainer.__command.
| # The dict where key is the container image and value its representation. | ||
| # Each Trainer representation defines trainer parameters (e.g. type, framework, entrypoint). | ||
| # TODO (andreyvelich): We should allow user to overrides the default image names. | ||
| ALL_TRAINERS: Dict[str, RuntimeTrainer] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would keeping the existing mapping, but with the framework as key instead of the image, make things tidier / more structured rather than replacing that with individual strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't need it, do we ? Theoretically, users should be able to use any arbitrary framework they define in the runtime labels with CustomTrainer, even that we don't support upstream (e.g. scikit-learn, tf, etc.).
The important mapping is only for BuiltinTrainers, since we need to map the framework name and BuiltinTrainer configs:
sdk/python/kubeflow/trainer/types/types.py
Line 154 in f3e9020
| BuiltinTrainer.__annotations__["config"].__name__.lower().replace("config", "") |
If you want, we can create mapping for BuiltinTrainer configs, but for now we have only 1 config (e.g. TorchTune).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that makes sense. It's more an internal implementation detail anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should verify that:
- CustomTrainer does not reference runtimes corresponding to BuiltinTrainer
- BuiltTrainer only reference corresponding runtimes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Thanks for creating this. I left my initial review for you.
| # The label key to identify ML framework that runtime uses (e.g. torch, deepspeed, torchtune, etc.) | ||
| RUNTIME_FRAMEWORK_LABEL = "trainer.kubeflow.org/framework" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also define TRAINER_TYPE_LABEL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't really need it as we discussed here: kubeflow/trainer#2761 (comment)
| func: Callable | ||
| func_args: Optional[Dict] = None | ||
| packages_to_install: Optional[List[str]] = None | ||
| packages_to_install: Optional[list[str]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we change from List to list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python 3.9 introduce builtin support for those types: https://docs.python.org/3/whatsnew/3.9.html#type-hinting-generics-in-standard-collections
|
|
||
|
|
||
| # Change it to list: BUILTIN_CONFIGS, once we support more Builtin Trainer configs. | ||
| TORCH_TUNE = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you prefer TORCHTUNE or TORCH_TUNE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe TORCH_TUNE is more accurate since we use camel case here: TorchTuneConfig.
| if isinstance(trainer, types.CustomTrainer): | ||
| trainer_crd = utils.get_trainer_crd_from_custom_trainer( | ||
| trainer, runtime | ||
| runtime, trainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add validation here to ensure that users reference the right runtime, as we discussed here: https://cloud-native.slack.com/archives/C0742LDFZ4K/p1753710956860929
| elif isinstance(trainer, types.BuiltinTrainer): | ||
| trainer_crd = utils.get_trainer_crd_from_builtin_trainer( | ||
| trainer, initializer | ||
| runtime, trainer, initializer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks for reminder!
|
/lgtm Thanks! |
Pull Request Test Coverage Report for Build 16706738665Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
d6603e6 to
033357e
Compare
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thank you!
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes: #31, #29
I refactored SDK to support runtime labels and give users ability to override the default runtime images.
/assign @Electronic-Waste @astefanutti @kramaranya @jskswamy @ned2 @eoinfennessy @briangallagher